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ABSTRACT 

Context: Computational diversity, i.e., the presence of a 
set of programs that all perform compatible services but 
that exhibit behavioral differences under certain conditions, 
is essential for fault tolerance and security. 

Objective: We aim at proposing an approach for auto¬ 
matically assessing the presence of computational diversity. 
In this work, computationally diverse variants are defined as 
(i) sharing the same API, (ii) behaving the same according 
to an input-output based specification (a test-suite) and (iii) 
exhibiting observable differences when they run outside the 
specified input space. 

Method: Our technique relies on test amplification. We 
propose source code transformations on test cases to explore 
the input domain and systematically sense the observation 
domain. We quantify computational diversity as the dissim¬ 
ilarity between observations on inputs that are outside the 
specified domain. 

Results: We run our experiments on 472 variants of 7 
classes from open-source, large and thoroughly tested Java 
classes. Our test amplification multiplies by ten the number 
of input points in the test suite and is effective at detecting 
software diversity. 

Conclusion: The key insights of this study are: the sys¬ 
tematic exploration of the observable output space of a class 
provides new insights about its degree of encapsulation; the 
behavioral diversity that we observe originates from areas of 
the code that are characterized by their flexibility (caching, 
checking, formatting, etc.). 

KEYWORDS: software diversity, software testing, test 
amplification, dynamic analysis. 

1. INTRODUCTION 

Computational diversity, i.e., the presence of a set of pro¬ 
grams that all perform compatible services but that exhibit 
behavioral differences under certain conditions, is essential 
for fault tolerance and security mmE- Consequently, 
it is of utmost importance to have systematic and efficient 
procedures to determine if a set of programs are computa¬ 
tionally diverse. 

Many works have tried to tackle this challenge, using input 
generation 12], static analysis [l3], or evolutionary testing 
[25| and 13 (concurrent of this work). Yet, having a reli¬ 
able detection of computational diversity for large object- 
oriented programs is still a challenging endeavor. 

In this paper, we propose an approach, called DSpolQ 

1 DSpot stands for diversity spotter 


for assessing the presence of computational diversity, i.e., to 
determine if a set of program variants exhibit different be¬ 
haviors under certain conditions. DSpot takes as input a test 
suite and a set of n program variants. The n variants have 
the same application programming interface (API) and they 
all pass the same test suite (i.e. they comply with the same 
executable specification). DSpot consists of two steps: (i) 
automatically transforming the test suite; and (ii) running 
this larger test suite, that we call “amplified test suite” on 
all variants to reveal visible differences in the computation. 

The first step of DSpot is an original technique of test 
25] , Our key insight is to combine 
the automatic exploration of the input domain with the sys¬ 
tematic sensing of the observation domain. The former is 
obtained by transforming the input values and method calls 
of the original test. The latter is the result of the analy¬ 
sis and transformation of the original assertions of the test 
suite, in order to observe the program state from as many 
observation points visible from the public API as possible. 
The second step of DSpot runs the augmented test suite on 
each variant. The observation points introduced during am¬ 
plification generate new traces on the program state. If there 
exists a difference between the trace of a pair of variants, we 
say that these variants are computationally diverse. In other 
words, two variants are considered diverse if there exists at 
least one input outside the specified domain that triggers 
different behaviors on the variants which can be observed 
through the public API. 

To evaluate the ability of DSpot at observing computa¬ 
tional diversity, we consider 7 open-source software applica¬ 
tions. For each of them, we create 472 program variants, and 
we manually check that they are computationally diverse, 
they form our ground truth. We then run DSpot for each 
program variant. Our experiments show that DSpot detects 
100% of the 472 computational diverse program variants. 
In the literature, the technique that is the most similar to 
test amplification is by Yoo and Harman [25], called “test 
data regeneration” (TDR for short), we use it as baseline. 
We show that test suites amplified with DSpot detect twice 
more computationally diverse programs than TDR. In par¬ 
ticular, we show that the new test input transformations 
that we propose bring a real added value with respect to 
TDR, to spot behavioral differences. 

To sum up, our contributions are: 

• an original set of test cases transformations for the auto¬ 
matic amplification of an object-oriented test suite. 

• a validation of the ability of amplified test suites to spot 
computational diversity in 472 variants of 7 open-source 
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from different compiler and different outputs at the module interface 
compilation options 

Figure 1: An High-level View of Software Diversity. 

public int subtract 1 ( int a, int b) { 
return a-b; 

> 

public int subtract2( int a, int b) throws 
OverFlowException { 

Biglnteger bigA = Biglnteger.valueOf(a); 

Biglnteger bigB = Biglnteger.valueOf(b); 

Biglnteger result = bigA.subtract(bigB); 
if (result.lowerThan(Integer.MIN_VALUE))) { 
throw new DoNotFitIn32BitException () ; 

> 

// the API requires an 32-bit integer value 

return result . intValue (); }}■ 

Listing 1: Two subtraction functions. They are NVP- 
Diverse: there exists some inputs for which the output are 
different. 


large scale programs. 

• a comparative evaluation against the closest related work 



• original insights about the natural diversity of computa¬ 
tion due to randomness and variety of runtime environ¬ 
ments. 

• a publicly available implementation [^J and benchmark^] 

The paper is organized as follows: section [2] expands on 
the background and motivations for this work; section [3] de¬ 
scribes the core technical contribution of the paper: the au¬ 
tomatic amplification of test suites; section [4] presents our 
empirical findings about the amplification of 7 real-world 
test suites and the assessment of diversity among 472 pro¬ 
gram variants. 

2. BACKGROUND 

In this paper, we are interested in computational diver¬ 
sity. Computational diversity is one kind of software di¬ 
versity. Figure [T] presents a high-level view of software di¬ 
versity. Software diversity can be observed statically either 
on source or binary code. Computational diversity is the 
one that happens at runtime. The computational diversity 
we target in this paper is NVP-Diversity, which relates to 
N-version programming. It can be loosely defined as com¬ 
putational diversity that is visible at the module interface: 
different outputs for the same input. 

2.1 N-version programming 

In the Encyclopedia of Software Engineering, N-version 
programming is defined as “a software structuring technique 

^http://diversify-project.github.io/ 
test-suite-amplification.html 
°http://diversify-project.eu/data/ 



Figure 2: Original and amplified points on the input and 
observation spaces of P. 


designed to permit software to be fault-tolerant, ie, able to 
operate and provide correct outputs despite the presence of 
faults” [14]. In N-version systems, N variants of the same 
module, written by different teams, are executed in parallel. 
The faults are defined as an output of one or more variants 
that differ from the majority’s output. Let us consider a 
simple example with 2 programs, pi and P 2 , if one observes 
a difference in the output for an input x - pi(x) ^ Pi{x) - 
then a fault is detected. 

Let us consider the example of Listing [l] It shows two im¬ 
plementations of subtraction, which have been developed by 
two different teams: a typical N-version setup, subtractl 
simply uses the subtraction operator. subtract2 is more 
complex, it leverages Biglnteger objects to handle potential 
overflows. 

The specification given to the two teams states that the 
expected input domain is [—2 16 ,2 16 ] x [—2 16 , 2 16 ]. To that 
extent, both implementations are correct and equivalent. 
These two implementations are run in parallel in produc¬ 
tion using a N-version architecture. 

If a production input is outside the specified input domain, 
e.g. subtractl (2 3J + 1, 2), the behavior of both implemen¬ 
tations is different and the overflow fault is detected. 


2.2 NVP-Diversity 

In this paper, we use the term NVP-Diversity to refer 
to the concept of computational diversity in N-version pro¬ 
gramming: 

Definition: Two programs are NVP-diverse if and only 
if there exists at least one input for which the output is 
different. 

Note that according to this definition, if two programs are 
equivalent on all inputs, they are not NVP-diverse. 

In this work, we consider programs in mainstream object- 
oriented programming languages (our prototype handles Java 
software). In OO programs, there is no such thing, as “in¬ 
put” and “outputs”. This requires us to slightly modify our 
definition of NVP-Diversity. 

Following 10 , we replace “input” by “stimuli” and “out¬ 
put” by “observation”. A stimuli is a sequence of method 
calls and their parameters on an object under test. An ob¬ 
servation is a sequence of calls to specific methods, to query 
the state of an object (typically getter methods). The in¬ 
put space X of a class P is the set of all possible stimuli 
for P. The observation space O is the set of all sets of 
observations. 

Now, we can clearly define NVP-diversity for OO-programs. 

Definition: Two classes are NVP-diverse if and only if 
there exists two respective instances that produce different 
observations for the same stimuli. 


2.3 Graphical Explanation 

The notion of NVP-diversity is directly related to activity 















of software testing as illustrated in figure [2] The first part 
of a test case, inch creation of objects and method calls, 
constitutes the stimuli, i.e. a point in the program’s input 
space (black diamonds in the figure). An oracle in the form 
of an assertion invokes one method and compares the result 
to an expected value: this constitutes an observation point 
on the program state that has been reached when running 
the program with a specific stimuli, the observation points 
of a test suite are black circles in the right hand side of the 
figure. To this extent, we say that a test suite specifies a 
set of relations between points in the input and observation 
spaces. 

2.4 Unspecified Input Space 

In N-Version programming, by definition, the differences 
that are observed at runtime happen for unspecified inputs, 
which we call the unspecified domain for short. In this 
paper, we consider that the points that are not exercised 
by a test suite form the unspecified domain. They are the 
orange diamonds in the left-hand side of the figure. 



Figure 3: An overview of DSpot: a decision procedure for 
automatically assessing the presence of NVP-diversity. 


3. OUR APPROACH TO DETECT COMPU¬ 
TATIONAL DIVERSITY 

We present DSpot, our approach to detect computational 
diversity. This approach is based on test suite amplification 
through automated transformations of test case code. 

3.1 Overview 

The global flow of DSpot is illustrated in figure [3] 

Input: DSpot takes as inputs a set of program variants 
Pi ... P n , which all pass the same test suite TS. Conceptu¬ 
ally, P x can be written in any programming language. There 
no assumption on the correctness or complexity of P x , the 
only requirements is that they are all specified by the same 
test suite. In this paper, we consider unit tests, however, the 
approach can be straightforwardly extended to other kinds 
of tests such as integration tests. 

Output: The output of DSpot is an answer to the ques¬ 
tion: are Pi ... P„ NVP-diverse? 

Process: First, DSpot amplifies the test suite to explore 
the unspecified input and observation spaces (as defined in 
Section[2|. As illustrated in figure[2j amplification generates 
new inputs and observations in the neighbourhood of the 
original points (new points are orange diamonds and green 
circles). This cartesian product of the amplified set of input 
and the complete set of observable points forms the amplified 
test suite ATS. 

Also, Figure [3] shows the step “observation point selec¬ 
tion”: this step removes the naturally random observations. 
Indeed, as discussed in more details further in the paper, 
some observations points produce diverse outputs between 
different runs of the same test case on the same program. 
This natural randomness comes from randomness in the 
computation and from specificities of the execution envi¬ 
ronment (addresses, file system, etc). 

Once DSpot has generated an amplified test suite, it runs 
it on a pair of program variants to compare their visible be¬ 
havior, as captured by the observation points. If some points 
reveal different values on each variant, they are considered 
as computationally diverse. 


Our approach for amplifying test suites systematically 
explores the neighbourhood of the input and observation 
points of the original test suite. In this section we discuss 
the different transformations we perform for test suite am¬ 
plification and algorithm [l] summarizes that procedure. 
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Data: TS 1 an initial test suite 
Result: TS' an amplified version of TS 

TStmp <- 0 

foreach test £ TS do 

foreach statement £ test do 
test' <— clone (test) 

TStmp <— remove(statement,test’) 
test" <— clone(test) 

TStmp £- duplicate(statement,test”) 
end 

foreach literalValue £ test do 

TStmp <— transform(literalValue,test) 

end 

end 

TS' <— TStmp U TS foreach test £ T S' do 
removeAssertions(test) 

end 

foreach test £ TS' do 

addObservationPoints(test) 

end 

foreach test £ TS' do 

filterObservationPoints(test) 

end 

Algorithm 1: Amplification of test cases 


3.2.1 Exploring the Input Space 
Literals and statement manipulation: The first step 
of amplification consists in transforming all test cases in 
the test suite with the following test case transformations. 
Those transformations operate on literals and statements: 

Transforming literals: given a test case tc, we run the fol¬ 
lowing transformations for every literal value: a String 
value is transformed in three ways: remove, add a ran¬ 
dom character, and replace a random character by an- 


3.2 Test Suite Transformations 































other one; a numerical value i is transformed in four 
ways: * + 1, i — 1, * X 2, i -j- 2; a boolean value is re¬ 
placed by the opposite value. These transformations 1 
are performed at line [To] of algorithm [l] 

Transforming statement: given a test case tc, for every 20 
statement s in tc we generate two test cases: one test 
case in which we remove s and another one in which 
we duplicate s. These transformations are performed 24 
at line [2] of algorithm [l] 

Given the transformations described above, the transfer- 26 
mation process has the following characteristics: (i) each 
time we transform a variable in the original test suite, we 
generate a new test case (i.e., we do not ‘stack’ the trans¬ 
formations on a single test case); (ii) the amplification pro¬ 
cess is exhaustive: given s the number of String values, n 
the number of numerical values, b the number of booleans 
and st the number of statements in an original test suite 
TS, DSpot produces an amplified test suite ATS of size: 

| ATS'| = s*3 + n*4-|-&-|-sf*2. 

These transformations, especially the one on statements, 
can produce test cases that cannot be executed (e.g., re¬ 
moving a call to add before a remove on a list). In our 
experiments, this accounted for approximately 10% of the 
amplified test cases. 

Assertion removal: The second step of amplification 
consists of removing all assertions from the test cases (line [2] 
of algorithm |14[ ). The rationale is that the original assertions 
are here to verify the correctness, which is not the goal of the 
generated test cases. Their goal is to assess computational 
differences. Indeed, assertions that were specified for test 
case ts in the original test suite are most probably mean¬ 
ingless for a test case that is variant of ts. When removing 
assertions, we are cautious to keep method calls that can be 
passed as a parameter of an assert method. We analyze the 
code of the whole test suite to find all assertions using the 
following heuristic: an assertion is a call to a method which 
name contains either assert or fail and which is provided 
by the JUnit framework. If one parameter of the assertion is 
a method call, we extract it, then we remove the assertion. 

In the final amplified test suite, we keep the original test 
case, but also remove its assertion. 

Listing [2] illustrates the generation of two new test cases. 
The first test method testEntrySetRemoveChangesMapO is 
the original one, slightly simplified for sake of presentation. 
The second one testEntrySetRemoveChangesMap_Add, du¬ 
plicates the statement entrySet. remove and does not con¬ 
tain the assertion anymore. The third test method testEn- 
trySetRemoveChangesMap_DataMutator replaces the numer¬ 
ical value 0 by 1. 

public void testEntrySetRemove() { // #1 

for (int i = 0; i < sampleKeys.length; i++) { 
entrySet.remove( new DefaultMapEntry<K, V>( 
sampleKeys[i], sampleValues[i])); 
assertFalse( 

"Entry should have been removed from the 
underlying map." , 

getMap().containsKey(sampleKeys[i])); 

y // end for 

. . . > 


public void testEntrySetRemove_Add() { // #2 

// call duplication 

entrySet.remove (new DefaultMapEntry<K, V>( 
sampleKeys [i] , sampleValues[i])); 


entrySet.remove (new DefaultMapEntry<K, V>( 
sampleKeys[i], sampleValues[i])); 
getMap().containsKey(sampleKeys[i]); 

. . . > 

public void testEntrySetRemove.Data() { // #3 

// integer increment 
// int i = 0 -> int i = 1 

for (int i = 1 ; i < (sampleKeys.length) ; i++) { 

entrySet.remove (new DefaultMapEntry<K, V>( 
sampleKeys[i], sampleValues[i])); 
getMap() .containsKey(sampleKeys [i]) ; 

y II end for 

. . . > 

Listing 2: A test case testEntrySetRemoveChangesMap 
(#1) that is amplified twice (#2 and #3) 


3.2.2 Adding Observation Points 

Our gaol is to observe different observable behaviors be¬ 
tween a program and variants of this program. Consequently, 
we need observation points on the program state. We do 
this by enhancing all the test cases in ATS with observation 
points(line 17 of algorithm 141. These points are responsi¬ 
ble for collecting pieces of information about the program 
state during or after the execution of the test case. In this 
context, an observation point is a call to a public method, 
which result is logged in an execution trace. 

For each object o in the original test case (o can be part 
of an assertion or a local variable of the test case), we do 
the following: 


• we look for all getter methods in the class of o (i.e., 
methods which name starts with get, that takes no pa¬ 
rameter and whose return type is not void, and meth¬ 
ods which name starts with is and return a boolean 
value) and call each of them. We also collect the values 
of all public fields. 


• if the toString method is redefined for the class of o, 
we call it (we ignore the hashcode that can be returned 
by toString) 

• if the original assertion included a method call on o, 
we include this method call as an observation point. 

Filtering observation points: This introspective pro¬ 
cess provides a large number of observation points. Yet, we 
have noted in our pilot experiments that some of the values 
that we monitor change from one execution to another. For 
instance, the identifier of the current thread changes between 
two executions. In Java, Thread. currentThreadO .getldO 
is an observation point that always needs to be discarded for 
instance. 

If we keep those naturally varying observation points, DSpot 
would say that two variants are different while the observed 
difference would be due to randomness. This would be spu¬ 
rious results that are irrelevant for computational diversity 
assessment. Consequently, we discard certain observation 
points as follows. We instrument the amplified tests ATS 
with all observation points. Then, we run ATS 30 times on 
P x , and repeat these 30 runs on three different machines. 
All observation points for which at least one value varies be¬ 
tween at least two runs are filtered out (line [17] of algorithm 
20 ). 

To sum up, DSpot produces an amplified test suite ATS 
that contains more test cases than the original one in which 
we have injected observation points in all test cases. 





Table 1: Descriptive Statistics about our Dataset 


Project 

Purpose 

Class 

LOC 

attests 

coverage 

^variants 

commons-codec 

Data encoding 

Base64 

255 

72 

98% 

12 

commons-collections 

Collection library 

TreeBidiMap 

1202 

111 

92% 

133 

commons-io 

Input/output helpers 

FileUtils 

1195 

221 

82% 

44 

commons-lang 

General purpose helpers (e.g. String) 

StringUtils 

2247 

233 

99% 

22 

guava 

Collection library 

HashBiMap 

525 

35 

91% 

3 

gson 

Json library 

Gson 

554 

684 

89% 

145 

JGit 

Java implementation of GIT 

CommitCommand 

433 

138 

81% 

113 


3.3 Detecting and Measuring the Visible Com¬ 
putational Diversity 

The final step of DSpot, runs the amplified test suite on 
pairs of program variants. Given Pi and P 2 , the number 
of observation points which have a different values on each 
variant accounts for visible computational diversity. When 
we compare a set of variants, we use the mean number of 
differences over each pair of variants. 

3.4 Implementation 

Our prototype implementation amplifies Java source code]^] 
The test suites are expected to be written using the JUnit 
testing framework, which is the #1 testing framework for 
Java. It uses Spoon [18] to manipulate the source code in 
order to create the amplified test cases. DSpot is able to 
amplify a test suite within minutes. 

The main challenges for the implementation of DSpot were 
as follows: handle the many different situations that occur 
in real-world large test suites (use different versions of JUnit, 
modularize the code of the test suite itself, implement new 
types of assertions, etc.); handle large traces for comparison 
of computation (as we will see in the next section, we collect 
hundreds of thousands observations on each variant); spot 
the natural randomness in test case execution to prevent 
false positives in the assessment of computational diversity. 

4. EVALUATION 

To evaluate whether DSpot is capable of detecting com¬ 
putational diversity, we set up a novel empirical protocol 
and apply it on 7 large-scale Java programs. Our guiding 
research question is: Is DSpot capable of identifying re¬ 
alistic large scale programs that are computationally 
diverse? 

4.1 Protocol 

First, we take large open-source Java programs that are 
equipped with good test suites. Second, we forge variants 
of those programs using a technique from our previous work 
j2j. We call the variants sosie programs]^] 

Definition 1. Sosie (noun). Given a program P, a test 
suite TS for P and a program transformation T, a variant 
P'=T(P) is a sosie of P if the two following conditions hold 
1) there is at least one test case inTS that executes the part 
of P that is modified by T 2) all test cases inTS pass on 
P'. 


4 the prototype is available here: 

http://diversity-project.github.io/ 
test-suite-amplification.html 

' J The word sosie is a French word that literally means “look 
alike” 


Given an initial program, we synthesize sosies with source 
code transformations that are based on the modification of 
the abstract syntax tree (AST). As previous work 16 
we consider three families of transformation that manipu¬ 
late statement nodes of the AST: 1) remove a node in the 
AST (Delete); 2) adds a node just after another one (Add); 
3) replaces a node by another one, e.g. a statement node 
is replaced by another statement (Replace). For “Add” and 
“Replace”, the transplantation point refers to where a 
statement is inserted, the transplant statement refers to 
the statement that is copied and inserted and both trans¬ 
plantation and transplant points are in the same AST (we 
do not synthesize new code, nor take code from other pro¬ 
grams). We consider transplant statements that manipu¬ 
late variables of the same type as the transplantation point, 
and we bind the names of variables in the transplant to 
names that are in the namespace of the transplantation 
point. We call these transformations Steroid transforma¬ 
tions, and more details are available in our previous work 

I- 

Once we have generated sosie programs, we manually se¬ 
lect a set of sosies that indeed expose some computational 
diversity. Third, we amplify the original test suites using 
our approach and also using a baseline technique by Yoo 
and Harman [25] presented in 
plified test suites and measure the proportion of variants 
(sosies) that are detected as computationally different. We 
also collect additional metrics to further qualify the effec¬ 
tiveness of DSpot. 

4.2 Dataset 

We build a dataset of subject programs for performing our 
experiments. The inclusion criteria are the following: 1) the 
subject program must be real-world software; 2) the subject 
program must be written in Java; 3) the subject program’s 
test suite must use the JUnit testing framework ; 4) the 
subject program must have a good test suite (a statement 
coverage higher than 80%). 

This results in Apache Commons Math, Apache Com¬ 
mons Lang, Apache Commons Collections, Apache Com¬ 
mons Codec and Google GSON and Guava. The dominance 
of Apache projects is due to the fact that they are among 
the very rare organizations with a very strong development 
discipline. 

In addition, we aim at running the whole experiments in 
less than one day (24 hours). Consequently we take a single 
class for each of those projects as well as all the test cases 
that exercise it at least once. 

Table [T] provides the descriptive statistics of our dataset. 
It gives the subject program identifier, its purpose, the class 
we consider, the class’ number of lines of code (LOC), the 
number of tests that execute at least once one method of 
the class under consideration, the statement coverage and 


4.3 Finally, we run both am- 
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the total number of program variants we consider (excluding 
the original program). We see that this benchmark covers 
different domains, such as data encoding and collections, 
and is only composed of well-tested classes. In total, there 
are between 12 and 145 computationally diverse variants of 
each program to be detected. This variation comes from 
the relative difficulty of manually forging computationally 
diverse variants depending on the project. 

4.3 Baseline 

In the area of test suite amplification, the work by Yoo 
and Harman [25] is the most closely related to our approach. 
Their technique is designed for augmenting input space cov¬ 
erage but can be directly applied to detecting computational 
diversity. Their algorithm, called test data regeneration - 
TDR for short - is based on four transformations on nu¬ 
merical values in test cases: data shifting ( \x.x + 1 and 
Xx.x — 1 ) and data scaling (multiply or divide the value 
by 2) and a hill-climbing algorithm based on the number of 
fitness function evaluations. They consider that a test case 
calls a single function, their implementation deals only with 
numerical functions and they consider the numerical output 
of that function as the only observation point. In our exper¬ 
iment, we reimplemented the transformations on numerical 
values since the tool used by Yoo is not available. We remove 
the hill-climbing part since it is not relevant in our case. An¬ 
alytically, the key differences between DSpot and TDR are: 
TDR stacks mutliple transformations together; DSpot has 
more new transformation operators on test cases: DSpot 
considers a richer observation space based on arbitrary data 
types and sequences of method calls. 

4.4 Research Questions 

We first examine the results of our test amplification pro¬ 
cedure 

RQla: what is the number of generated test cases? 

We want to know whether our transformation operators on 
test cases enable us to create many different new test cases, 
i.e. new points in the input space. Since DSpot systemati¬ 
cally explores all neighbors according to the transformation 
operators, we measure the number of generated test cases to 
answer this basic research question. 

RQlb: what is the number of additional obser¬ 
vation points? In addition to creating new input points, 
DSpot creates new observation points. We want to know the 
order of magnitude of the number of those new observation 
points. To have a clear explanation, we start by performing 
only observation point amplification (without input point 
amplification) and count the total number of observations. 
We compare this number with the initial number of asser¬ 
tions, which exactly corresponds to the original observation 
points. 

Then, we evaluate the ability of the amplified test suite 
to assess computational diversity. 

RQ2a: does DSpot identify more computationally 
diverse programs than TDR? Now, we want to compare 
our technique with the related work. We count the number 
of variants that are identified as computationally different 
using DSpot and TDR. The one with with the highest value 
is better. 

RQ2b: does the efficiency of DSpot come from the 
new inputs or the new observations? DSpot stacks 
two techniques: the amplification of the input space and the 


amplification of the observation space. To study their im¬ 
pact in isolation, we count the number of computationally 
diverse program variants that are detected by the original 
input points equipped with new observation points and by 
the amplified set of input points with the original observa¬ 
tions. 

The last research questions digs deeper in the analysis of 
amplified test cases and computationally diverse variants. 

RQ3a: What is the number of natural random¬ 
ness in computation? Recall that DSpot removes some 
observation points that naturally varies even on the same 
program. This phenomenon is due to the natural random¬ 
ness of computation. To answer this question quantitatively, 
we count the number of discarded observation points, to an¬ 
swer it quantitatively, we discuss one case study. 

RQ3b: what is the richness of computational di¬ 
versity? Now, we really understand the reasons behind 
the computational diversity we observe. We take a random 
sample of three pairs of computationally diverse program 
variants and analyze them. We discuss our findings. 

4.5 Empirical Results 

We now discuss the empirical results obtained on applying 
DSpot on our dataset. 

4.5.1 # of Generated Test Cases 

Table[2]presents the key statistics of the amplification pro¬ 
cess. The lines of these table go by pair: one that provides 
data for one subject program and the following one that pro¬ 
vides the same data gathered with the test suite amplified 
by DSpot. Columns from 2 to 5 are organized in two groups: 
the first group gives a static view on the test suites (e.g. how 
many test methods are declared); the second group draws 
a dynamic picture of the test suites under study (e.g. how 
many assertions are executed). 

Indeed, in real, large-scale programs, test cases are mod¬ 
ular. Some test cases are used multiple times because they 
are called by other test cases. For instance, a test case that 
specifies a contract on a collection is called when testing all 
implementations of collections (ArrayList, LinkedList, etc.). 
We call them generic tests. 

Let’s first concentrate on the static values. Column 2 gives 
the number of test cases in the original and amplified test 
suites, while column 3 gives the number of assertions in the 
original test suites and the number of observations in the 
amplified. 

One can see that our amplification process is massive. We 
create between 4x and 12x more test cases than the origi¬ 
nal test suites. For instance, the test suite considered for 
commons.codec contains 72 test cases. DSpot produces an 
amplified test suite that contains 672 test methods: 9x more 
than the original test suite. The original test suite observes 
the state of the program with 509 assertions, while DSpot 
employs 10597 observations points to detect computational 
differences. 

Let us now consider the dynamic part of the table. Col¬ 
umn 4 gives the number of tests executed (#TC exec.) and 
column 5 the number of assertions executed or the number 
of observation points executed. Column 6 gives the number 
of the discarded observation points because of natural vari¬ 
ations (discussed in more details in section [4.5.4[ ). As we 
can see, the number of generated tests (#ATC exec.) is im¬ 
pacted by amplification. For instance, for commons.collection 
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Table 3: The effectiveness of computational diversity detection 
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there are 1291 tests in the amplified test suite, but alto¬ 
gether, 9202 test cases are executed. The reason is that we 
synthesize new test cases that use other generic test meth¬ 
ods. Consequently, this increases the number of executed 
generic test methods, which is included in our count. 

Our test case transformations yield a rich exploration of 
the input space. Columns 7 to 11 of Table [2] provide deeper 
insigths about the synthesized test cases. Colum 7 gives the 
branch coverage of the original test suites and the amplified 
ones (lines with *-DSPOT identifiers). While original test 
suites have a very high branch coverage rate, yet, DSpot is 
still able to generate new teststhat cover a few previously 
uncovered branches. For instance, the amplified test suite 
for commons-io/FileUtils reaches 7 branches that were not 
executed by the original test suite. Meanwhile, the original 
test suite for guava/HashBiMap already covers 90% of the 
branches and DSpot did not generate test cases that cover 
new branches. 

The richness of the amplified test suite is also revealed in 
the last column of the table (path coverage): it provides the 
cumulative number of different paths executed by the test 
suite in all methods under test. The amplified test suites 
cover much more paths than the original ones, which means 
that they trigger a much wider set of executions of the class 
under test than the original test suites. For instance, for 
Guava, the total number of different paths covered in the 
methods under test increases from 84 to 137. This means 
that, while the amplified test suite does not cover many new 
branches, it executes the parts that were already covered 
in many novel ways, increasing the diversity of executions 
that are tested. There is one extreme case in the encode 


method of commons-codetQ the original test suite covers 
780 different paths in this method, while the amplified test 
suite covers 11356 different paths. This phenomenon is due 
to the complex control flow of the method and to the fact 
that its behavior directly depends on the value of an array 
of bytes that takes many new values in the amplified test 
suite. 


The amplification process is massive and produces 
rich new input points: the number of declared and ex¬ 
ecuted test cases and the diversity of executions from 
test cases increase. 


4.5.2 # of Generated Observation Points 

Now we focus on the observation points. The fourth col¬ 
umn of Table[2]gives the number of assertions in original test 
suite. This corresponds to the number of locations where 
the tester specifies expected values about the state of the 
program execution. The fifth column, gives the number of 
observation points in the amplified test suite. We do not call 
them assertions since they do not contain an expected value, 
i.e., there is no oracle. Recall that we use those observation 
points to compare the behavior of two program variants in 
order to assess the computational diversity. 

As we can see, we observe the program state on many 
more observation points than the original assertions. As dis¬ 
cussed in Section [2.2| those observations points use the API 

(, line 331 in the Base64 class https: 
//github.com/apache/commons-codec/blob/ 
ca8968be63712cldcce006a6d6ee9ddcef0e0a51/src/main/ 
j ava/org/apache/commons/codec/binary/Base64.j ava 












of the program under consideration, hence allow to reveal 
visible and exploitable computational diversity. However, 
this number also encompasses the observation points on the 
new generated test cases. 

If we look at the dynamic perspective (second part of Ta¬ 
ble [2|, one observes the same phenomenon as for test cases 
and assertions, there are many more points actually ob¬ 
served during test execution than statically declared ones. 
The reasons are identical, many observations points are in 
generic test methods that are executed several times, or are 
within loops in test code. 


These results validate our initial intuition that a test 
suite only covers a small portion of the observation 
space. It is possible to observe the program state from 
many other observation points. 


4.5.3 Effectiveness 

We want to assess whether our method is effective for iden¬ 
tifying computationally diverse program variants. As golden 
truth, we have the forged variants for which we know that 
they are NVP-diverse (see Section 4.11, their numbers are 
given in the descriptive Table [l] The benchmark is publicly 
available at http://diversify-project.eu/data/ 

We run DSpot and TDR to see whether those two tech¬ 
niques are able to detect the computationally diverse pro¬ 
grams. Table[3]gives the results of this evaluation. The first 
column contains the name of the subject program. The sec¬ 
ond column gives the number of variants detected by DSpot. 
The third column gives the number of variants detected by 
TDR. The last three columns explore more in depth whether 
computational diversity is reveales by new input points or 
new observation points or both, we will come back to them 
later. 

As we can see, DSpot is capable of detecting all computa¬ 
tionally diverse variants of our benchmark. On the contrary, 
the baseline technique, TDR, is always worse. Either it de¬ 
tects only a fraction of them (e.g. 10/12 for commons.codec) 
or even not at all. The reason is that TDR, as originally pro¬ 
posed by Yoo and Harman, focuses on simple programs with 
shallow input spaces (one single method with integer argu¬ 
ments). On the contrary, DSpot is designed to handle rich 
input spaces, inch constructor calls, method invocations 
and strings. This has a direct impact on the effectiveness of 
detecting computational diversity in program variants. 

Our technique is based on two insights: the amplification 
of the input space and the amplification of the observation 
space. We now want to understand the impact of each of 
them. To do so, we disable one or the other kind of ampli¬ 
fication and measure the number of detected variants. The 
result of this experiment is given in the last two columns of 
Table [3] Column “input space effect” gives the number of 
variants that are detected only by the exploration of the in¬ 
put space (i.e. by observing the program state only with the 
observation method used in the original assertions). Column 
“observation space effect” gives the number of variants that 
are detected only by the exploration of the observation space 
(i.e. by observing the result of method calls on the objects 
involved in the test). For instance, for commons-codec, all 
variants (12/12) are detected by exploring the input space, 
and 10/12 are detected by exploring the observation space. 
This means that 10 of them are detected are detected either 


by one exploration or the other one. On the contrary for 
guava, only the exploration of the observation space enables 
DSpot to detect the three computationally diverse variants 
of our benchmark. 

By comparing columns “input space effect” and “observa¬ 
tion space effect” one sees that our two explorations are not 
mutually exclusive and are complementary. Some variants 
are detected by both kinds of exploration (as in the case of 
commons-codec). For some subjects, only the exploration 
of the input space is effective (e.g. commons-lang), while 
for others (guava), this is the opposite. Globally, the explo¬ 
ration of the input space is more efficient, most variants are 
detected this way. 

Let us now consider the last column of Table [3] It gives 
the mean number of observation points for which we observe 
a difference between the original program and the variant 
to be detected. For instance, among the 12 variants for 
commons.codec, there is on average 21.9 observation points 
for which there is a difference. Those numbers are high, 
showing that the observation points are not independent. 
Many of the methods we call to observe the program state 
inspect a different facet of the same state. For instance, in 
a list, the methods isEmptyO and size are semantically 
correlated. 


The systematic exploration of the input and the ob¬ 
servation spaces is effective at detecting behavioral di¬ 
versity between program variants. 


4.5.4 Natural Randomness of Computation 

When experimenting with DSpot on real programs, we 
noticed that some observation points naturally vary even 
when running the same test case several times on the same 
program. For instance, a hashcode that takes into account 
a random salt can be different between two runs of the same 
test case. We call this effect, the “natural randomness” of 
test case execution. 

We distinguish two kinds of natural variations in the ex¬ 
ecution of test suites. First, some observation points vary 
over time when the test case is executed several times on the 
same environment (same machine, OS, etc.). This is the case 
for the hashcode example. Second, some observation points 
vary depending on the execution environment. For instance, 
if one adds an observation point on a file name, the path 
name convention is different on Unix and Windows systems. 
If method getAbsolutePath is an observation point, it may 
return "/tmp/foo.txt" on Unix and "C:\tmp\foo.txt" on 
Windows. While this first example is pure randomness, the 
second only refers to variations in the runtime environment. 

Interestingly, this natural randomness is not problematic 
in the case of the original test suites, because it remains 
below the level of observation of the oracles (the test suite 
assertions in JUnit test suites). However, in our case, if one 
keeps an observation point that is impacted by some natural 
randomness, this would produce a false positive for com¬ 
putational diversity detection. Hence, as explained in Sec¬ 
tion [3] one phase of DSpot consists in detecting the natural 
randomness first and discarding the impacting observation 
points. 

Our experimental protocol enables us to quantify the num¬ 
ber of discarded observation points. The 6th column of 
Table [2] gives this number. For instance, for commons- 





void testCanonicalEmptyCollectionExists () { 
if (((supportsEmptyCollections ()) && ( 
isTestSerialization())) && (!( 
skipSerializedCanonicalTests()))) { 

Object object = makeObject(); 
if (object instanceof Serializable) { 

String name = getCanonicalEmptyCollectionName ( 

obj ect ) ; 6 

File f = new j ava . io . File (name ) ; 

// observation on f 8 

Logger.logAssertArgument(f.getCanonicalPath() ) ; 
Logger.logAssertArgument(f.getAbsolutePath()); 


» 10 

> 

Listing 3: An amplified test case with observation points 
that naturally vary, hence are discarded by DSpot 


codec, DSpot detects 12 observation points that naturally 
vary. This column shows two interesting facts. First, there i 
is a large variation in the number of discarded observation 
points, it goes up to 54313 for commons-io. This case, to¬ 
gether with JGIT (the last line), is due to the heavy depen¬ 
dency of the library on the underlying file system (commons- 
io is about IO - hence file systems -operations, JGIT is 
about manipulating GIT versioning repositories that are also 7 
stored on the local file system). 

Second, there are two subject programs (commons-collections 
and guava) for which we discard no points at all. In those 
programs, DSpot does not detect a single point that nat¬ 
urally varies by running 100 times the test suite on three 
different operating systems. The reasons is that the API of 
those subject programs does not allow to inspect the inter¬ 
nals of the program state up to the naturally varying parts 
(e.g. the memory addresses). We consider this good as this, 
it shows that the encapsulation is good: more than providing 
an intuitive API, more than providing a protection against 
future changes, it also completely encapsulates the natural 
randomness of the computation. 

Let us now consider a case study. Listing [3] shows an 
example of an amplified test with observation points for 
Apache Commons Collection. There are 12 observation meth¬ 
ods that can be called on the object f instance of File (11 
getter methods and toString). The figure shows two getter 
methods that return different values from one run to another 
(there are 5 getter methods with that kind of behavior for 
a File object). We ignore these observation points when 
comparing the original program with the variants. 


The systematic exploration of the observable output 
space provides new insights about the degree of encap¬ 
sulation of a class. When a class gives public access to 
variables that naturally vary, there is a risk that when 
used in oracles, they result in flaky test cases. 


4.5.5 Nature of Computational Diversity 
Now we want to understand more in depth the nature of 
the NVP-diversity we are observing. Let us discuss three 
case studies. 

Listing [4] shows two variants of the writeStringToFile () 
method of Apache Commons IO. The original program calls 
openOutputStream, which checks different things about the 
file name, while the variant directly calls the constructor of 


//original program 

void writeStringToFile(File file, String data. 

Charset encoding, boolean append) throws 
IOException { 

OutputStream out = null; 

out = openOutputStream(file, append); 

IOUtils.write(data, out, encoding); 
out.close () ; } 

// variant 

void writeStringToFile(File file, String data, 

Charset encoding, boolean append) throws 
IOException { 

OutputStream out = null; 

out = new FileOutputStream(file, append); 

IOUtils.write(data, out, encoding); 
out.close () ; } 

Listing 4: Two variants of writeStringToFile in 

commons, io 


void testCopyDirectoryPreserveDates() { 
try { 

File sourceFile = new File(sourceDirectory, "hello/ 
txt " ); 

FileUtils.writeStringToFile(sourceFile, "HELLO 
WORLD", "UTF8" ); 
catch (Exception e) { 

DSpot.observe(e.getMessage()); 

> 

> 

Listing 5: Amplified test case that reveals computational 
diversity between variants of listing [4] 


FileOutputStream. These two variants behave differently 
outside the specified domain: in case writeStringToFileO 
is called with an invalid file name, the original program han¬ 
dles it, while the variant throws a FileNotFoundException. 
Our test transformation operator on String values produces 
such a file name, as shown in the test case of listing [5] a 
is changed into a star “/”. This made the file name an 
invalid one. Running this test on the variant results in a 
FileNotFoundException. 

Let us now consider listing |6j which shows two variants 
of the toJsonO method from the Google Gson library. The 
last statement of the original method is replaced by another 
one: instead of setting the serialization format of the writer 
it set the indent format. Each variant creates a JSon with 
slightly different formats, and none of these formatting deci¬ 
sions are part of the specified domain (and actually, specify¬ 
ing the exact formatting of the JSon String could be consid¬ 
ered as over-specification). The diversity among variants is 
detected by the test cases displayed in figure [ 7 ] which adds 
an observation point (a call to toStringO) on instances of 
StringWriter, which are modified by toJsonO. 


// Original program 

void toJson(Object src, Type typeOfSrc , JsonWriter 
writer){ 

writer . setSerializeNulls ( oldSerializeNulls ) ; }■ } 

//variant 

void toJson(Object src, Type typeOfSrc, JsonWriter 
writer){ 

writer.set Indent ( " ") 

> > 

Listing 6: Two variants of toJson in GSON 




l public void testWriteMixedStreamed_remove534() 
throws IOException { 

gson.toJson(RED_MIATA, Car. class , jsonWriter); 
jsonWriter.endArray () ; 

Logger.logAssertArgument(com.google.gson. 

MixedStreamTest.CARS_JS0N); 

Logger.logAssertArgument(stringWriter.toString ()) ; 


// Original program 

2 void decode (final byte [] in, int inPos, final int 
inAvail , final Context context) { 
switch ( context . modulus ) -( 

4 case 0 : // impossible, as excluded above 

case 1 : // 6 bits - ignore entirely 
6 // not currently tested; perhaps it is 

impossible ? 
break ; 

8 > 


Listing 7: Amplified test detecting black-box diversity 1Q 
among variants of listing [6] 


The next case study is in listing [8] two variants of the 
method decode () in the Base64 class of the Apache Com¬ 
mons Codec library. The original program has a switch- 
case statement in which case 1 execute a break. An original 
comment by the programmers indicates that it is probably 
impossible. The test case in listing [9] amplifies one of the 
original test case with a mutation on the String value in the 3 
encodedInt3 variable (the original String has an additional 
‘\’ character, removed by the “remove character” transfer- 5 
mation). The amplification on the observation points adds 
multiple observations points. The single observation point 7 
shown in the listing is the one that detects computational 
diversity: it calls the static decodelnteger 0 method which 
returns 1 on the original program and 0 on the variant. In 
addition to validating our approach, this example anecdo¬ 
tally answers the question of the programmer, case 1 is pos¬ 
sible, it can be triggered from the API. 

These three case examples are meant to give the reader 
a better idea of how DSpot was able to detect the variants. 

We discuss how augmented test cases reveal this diversity 
(both with amplified inputs and observation points). We 
illustrate three categories of code variations that maintain 
the expected functionality as specified in the test suite, but 
still induce diversity (different checks on input, different for¬ 
matting, different handling of special cases). 


The diversity that we observe originates from areas 
of the code that are characterized by their flexibility 
(caching, checking, formatting, etc.). These areas are 
very close to the concept of forgiving region proposed 
by Martin Rinard [21 . 


4.6 Threats to Validity 

DSpot is able to effectively detect NVP-diversity using 
test suite amplification. Our experimental results are sub¬ 
ject to the following threats. 

First, this experiment is highly computational, a bug in 
our evaluation code may invalidate our findings. However, 
since we have manually checked a sample of cases (the case 
studies of Section 4.5.4 and Section 4.5.51 we have a high 
confidence in our results. Our implementation is publicly 
available 0 

Second, we have forged the computationally diverse pro¬ 
gram variants. Eventually, as shown on Table [3j our tech¬ 
nique DSpot is able to detect them all. The reason is that 
we had a bias towards our technique when forging those 


'http://diversity-project.github.io/ 
test-suite-amplification.html 


// variant 

void decode (final byte [] in, int inPos, final int 
inAvail, final Context context) { 
switch (context.modulus) { 

case 0 : // impossible, as excluded above 
case 1 : 

> 

Listing 8: Two variants of decode in commons.codec 


©Test 

void testCodeInteger3_literalMutation222() { 

String encodedInt3 = 

"FKIhdgaG5LGKiEtFlvHy4f3y700zaD6QwDS3IrNVGzNp2" 

+ "rY + 1LFWTK6D44AyiCln8uWz1itkYMZFOaKDKOY j g==" ; 
Logger.logAssertArgument(Base64.decodelnteger( 
encodedInt3.getBytes(Charsets.UTF_8))); 

» 

Listing 9: Amplified test case that reveals the 

computational diversity between variants of listing [8] 


variants. This is true for all self-made evaluations. This 
threat on the results of the comparative evaluation against 
TDR is mitigated by the analytical comparison of the two 
approaches. Both the input space and the output space of 
TDR (respectively an integer tuple and a returned value) are 
simpler and less powerful than our amplification technique. 

Third, our experiments consider one programming lan¬ 
guage (Java) and 7 different application domains. To further 
assess the external validity of our results, new experiments 
are required on different technologies and more application 
domains. 


5. RELATED WORK 

The work presented is related to two main areas: the iden¬ 
tification of similarities or diversity in source code and the 
automatic augmentation of test suites. 

Computational diversity The recent work by Carzaniga 
et al. 3] has a similar intent as ours: automatically identify¬ 
ing dissimilarities in the execution of code fragments that are 
functionally similar. They use random test cases generated 
by Evosuite to get execution traces and log the internals of 
the execution (executed code and the read/write operations 
on data). The main difference with our work is that they 
assess computational diversity and with random testing in¬ 
stead of test amplification. 

Koopman and DeVale 15] aim at quantifying the diver¬ 
sity among a set of implementations of the POSIX operating 
system, with respect to their responses to exceptional con¬ 
ditions. Diversity quantification in this context is used to 
detect which versions of POSIX provide the most different 
failure profiles and should thus be assembled to ensure fault 
tolerance. Their approach relies on Ballista to generate mil¬ 
lions of input data and the outputs are analyzed to quantify 
the difference. This is an example of diversity assessment 






with intensive fuzz testing and observation points on crash¬ 
ing states. 

Many other works look for semantic equivalence or diver¬ 
sity through static or dynamic analysis. Gabel and Su [ 7 ] in¬ 
vestigate the level of granularity at which diversity emerges 
in source code. Their main finding is that, for sequences 
up to 40 tokens, there is a lot of redundancy. Beyond this 
(of course fuzzy) threshold, the diversity and uniqueness 
of source code appears. Higo and Kusumoto 


11 


investi¬ 
gate the interplay between structural similarity, vocabulary 
similarity and method name similarity, to assess functional 
similarity between methods in Java programs. They show 
that many contextual factors influence the ability of these 
similarity measures to spot functional similarity (e.g., the 
number of methods that share the same name, or the fact 
that two methods with similar structure are in the same 
class or not). Jiang and Su 12 extract code fragments of 
a given length and randomly generate input data for these 
snippets. Then, they identify the snippets that produce the 
same output values (which are considered functionally equiv¬ 
alent, w.r.t the set of random test inputs). They show that 
this method identifies redundancies that static clone detec¬ 
tion does not find. Kawaguchi and colleagues [13] focus on 
the introduction of changes that break the interface behav¬ 
ior. They also use a notion of partial equivalence, where “two 
versions of a program need only be semantically equivalent 
under a subset of all inputs”. Gao and colleagues [8] pro¬ 
pose a graph-based analysis to identify semantic differences 
in binary code. This work is based on the extraction of call 
graphs and control flow graphs of both variants and on com¬ 
parisons between these graphs in order to spot the semantic 
variations. Person and colleagues 19 developed differential 
symbolic execution, which can be used to detect and char¬ 
acterize behavioral differences between program versions. 

Test suite amplification In the area of test suite am¬ 
plification, the work by Yoo and Harman [25] is the most 
closely related to our approach, and we used as the baseline 
for computational diversity assessment. They amplify test 
suites only with transformations on integer values, while we 
also transform boolean and String literals, as well as state¬ 
ments test cases. Yoo and Harman also have two additional 
parameters for test case transformation: the interaction level 
that determines the number of simultaneous transformation 
on the same test case, and the search radius that bounds 
their search process when trying to improve the effectiveness 
of augmented test suites. Their original intent is to increase 
the input space coverage to improve test effectiveness. They 
do not handle the oracle problem in that work. 

Xie [23 augments test suites for Java program with new 
test cases that are automatically generated and he automat¬ 
ically generates assertions for these new test cases, which 
can check for regression errors. Harder et al. [9] propose 
to retrieve operational abstractions, i.e., invariant properties 
that hold for a set of test cases. These abstractions are then 
used to compute operational differences, which detects di¬ 
versity among a set of test cases (and not among a set of 
implementations as in our case). While the authors mention 
that operational differencing can be used to augment a test 
suite, the generation of new test cases is out of this work’s 
scope. Zhang and Elbaum [26] focus on test cases that verify 
error handling code. Instead of directly amplifying the test 
cases as we propose, they transform the program under test: 
they instrument the target program by mocking the exter¬ 


nal resource that can throw exceptions, which allow them to 
amplify the space of exceptional behaviors exposed to the 
test cases. Pezze et al. [20] use the information provided 
in unit test cases about object creation and initialization 
to build composite test cases that focus on interactions be¬ 
tween classes. Their main result is that the new test cases 
find faults that could not be revealed by the unit test cases 
that provided the basic material for the synthesis of compos¬ 
ite test cases. Xu et al. [24] refer to “test suite augmentation” 
as the following process: in case a program P evolves into P’, 
identify the parts of P’ that need new test cases and gener¬ 
ate these tests. They combine concolic and search-based test 
generation to automate this process. This hybrid approach 
is more effective than each technique separately, but with in¬ 
creased costs. Dallmeier et al. [2] automatically amplify test 
suites by adding and removing method calls in JUnit test 
cases. Their objective is to produce test cases that cover a 
wider set of execution states than the original test suite in 
order to improve the quality of models reverse engineered 
from the code. 


6. CONCLUSION 

In this paper, we have presented DSpot, a novel technique 
for detecting one kind of computational diversity between a 
pair of programs. This technique is based on test suite am¬ 
plification: the automatic transformation of the original test 
suite. DSpot uses two kinds of transformations, for respec¬ 
tively exploring new points in the program’s input space and 
exploring new observation points on the execution state, af¬ 
ter execution with the given input points. 

Our evaluation on large open-source projects shows that 
test suites amplified by DSpot are capable of assessing com¬ 
putational diversity and that our amplification strategy is 
better than the closest related work, a technique called TDR. 
by Yoo and Harman [25]. We have also presented a deep 
qualitative analysis of our empirical findings. Behind the 
performance of DSpot, our results shed an original light on 
the specified and unspecified parts of real-world test suites 
and the natural randomness of computation. 

This opens avenues for future work. There is a relation 
between the natural randomness of computation and the so- 
called flaky tests (those tests that occasionally fail). To use, 
the assertions of the flaky tests are at the border of the nat¬ 
ural undeterministic parts of the execution: sometimes they 
hit it, sometimes they don’t. With such a view, we imag¬ 
ine an approach that characterizes this limit and proposes 
an automatic refactoring of the flaky tests so that they get 
farther from the limit of the natural randomness and enter 
again into the good, old and reassuring world of determin¬ 
ism. 
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